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Abstract 

Block and global Krylov subspace methods have been proposed as methods 
adapted to the situation where one iteratively solves systems with the same 
matrix and several right hand sides. These methods are advantageous, since they 
allow to cast the major part of the arithmetic in terms of matrix-block vector 
products, and since, in the block case, they take their iterates from a potentially 
richer subspace. In this paper we consider the most established Krylov subspace 
methods which rely on short recurrencies, i.e. BiCG, QMR and BiCGStab. We 
propose modifications of their block variants which increase numerical stability, 
thus at least partly curing a problem previously observed by several authors. 
Moreover, we develop modifications of the “global” variants which almost halve 
the number of matrix-vector multiplications. We present a discussion as well as 
numerical evidence which both indicate that the additional work present in the 
block methods can be substantial, and that the new “economic” versions of the 
“global” BiCG and QMR method can be considered as good alternatives to the 
BiCGStab variants. 
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1. Introduction 

We consider simultaneous linear systems with the same matrix A G C"^" 
and s right-hand sides (r.h.s.) bi, 

Axi = bi, i = 1,..., s, (1) 

which we also formulate in block form as the matrix equation 

AX = B, where A = [a:i|... \xs],B = [5i|... |6,] G 
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As a general rule we will use lower case letters with sub-indices for the columns 
of the matrix denoted with the corresponding upper case letter. 

Our overall assumption is that A is large and sparse and that a (precon¬ 
ditioned) Krylov subspace type iterative method is to be used to iteratively 
approximate the solutions for the different r.h.s. Our interest is in methods 
which efficiently take advantage of the fact that we are given the s linear sys¬ 
tems with the r.h.s. bi,i = 1, ■ ■ ■, s simultaneously, and we use the generic name 
simultaneous methods for any such method. Investigations in this setting are 
not new, the development and analysis of block variants of the standard Krylov 
subspace methods dating back to at least the mid 1980s [53]. With the ad¬ 
vent of massively parallel processor architectures such as GPUs, where memory 
access is usually the determining factor for computational speed, simultaneous 
linear systems of the form Q offer the possibility of optimizing memory access 
to A by reading and storing A only once while updating the iterates for all s 
r.h.s. simultaneously. Algorithms which take this into account can therefore be 
expected to run faster because their arithmetic operations perform faster. More¬ 
over, simultaneous methods also offer the potential to, in addition, increase the 
speed of convergence when obtaining their iterates from a larger, “block” Krylov 
subspace rather than the standard Krylov subspaces. 

As a rule of thumb, working with s several r.h.s. simultaneously instead of 
just one increases the number of vectors to be stored by a factor of s. This is why 
simultaneous variants of the GMRES method may, due to the long recurrences 
present in GMRES, suffer from the fact that the method has to be restarted 
after a relatively small number of iterations. This typically degrades the speed 
of convergence. 

In this paper we therefore focus on simultaneous Krylov type methods relying 
on short recurrences. We assume that A is a general non-singular matrix, i.e. 
we do not assume that we can take advantage of additional properties of A like 
symmetry, e.g. Typical and well-established methods for a single right-hand 
side in this case are BiCG (bi-conjugate gradients [IQ]), BiCGStab (stabilized 
BiGG [35]) and QMR (quasi-minimal residual method [H]), and we mainly 
develop on three different simultaneous variants for each of these three methods 
in this paper. In section [^ we introduce the three different principles termed 
loop-interchanging^ global and block to generate a simultaneous variant from a 
given (single r.h.s.) Krylov subspace method and, as an example, describe the 
resulting BiGG variants in some detail. We then devote section[^to a discussion 
of the various simultaneous methods known from the literature in our context 
before developing two improvements in section [^ One improvement reduces 
the arithmetic cost in the global BiGG and QMR variants by nearly a factor 
of 2, suppressing almost all of the matrix-vector multiplications with A^. The 
other improvement enhances numerical stability in the block methods through 
the judicious use of QR-factorizations of the matrix of simultaneous residuals. 
Section [^ contains an extensive numerical study for the various methods and 
conclusions on which methods are to be preferred in which situations. 
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2. Loop-interchanging, global and block Krylov methods 

In this section we describe three different approaches to obtain a simultane¬ 
ous Krylov subspace method for s r.h.s. from a given method for a single r.h.s. 
All resulting simultaneous methods reduce to the single r.h.s. method for s = 1. 

Solving the systems Q one after the other with the same Krylov subspace 
method can be viewed as performing two nested loops with the outer loop ( “for 
i = 1, ..., s”) running over the different r.h.s. and the inner loop (“for fc = 1,... 
until convergence”) performing the iteration of the Krylov subspace method. 
These two loops can be interchanged, resulting in an algorithm where each 
sweep through the inner loop can be organized in a way such that all occurring 
matrix-vector multiplications for all i are gathered into one matrix-block-vector 
multiplication, the term “block vector” denoting a matrix of size n x s. For 
BiCG as the underlying single r.h.s. Krylov subspace method, Algorithmj^states 
the resulting Li-BiCG method, where Li stands for “loop interchanged”. Note 
that whenever possible we cast inner loop operations as operations with n x s 
matrices with, e.g., column of G ^nxs BiGG iterate for the 

r.h.s. bi, etc. The inner loop index appears explicitly only when calculating the 
f-th diagonal entries and of the sx s diagonal matrices diag(a^*^^) and 
diag(/3*'^^), respectively. All multiplications with the matrices A and appear 
as matrix-block-vector multiplications AP^^'> and A^. If the nominator in 
the definition of or is zero, the iteration for r.h.s. hi breaks down. 


Algorithm 1: Loop-interchanged BiGG (Li-BiGG) 

c 

P 

f( 

hoose G 

ut =B- AX^°\ 

or fc = 0,1, 2,... until convergence do 

Q(k) _ ^p(fc) 

^(fe) ^ z = 1,.. .,s 

^(fe+i) ^ ^(fc) p(fe)diag(a('=)) 

p(/c+i) ^ p(fe) _ Q(fe)diag(a('=)), - (A^p('=))diag(a(G) 

V(rf \ ^ = 1,..., 5 

p(k+i) ^ p(fc+i) p(fe)diag(/?('=)), -p pWdiag(/3W) 


Global methods take a different approach. They cast the s systems Q into 
one big, tensorized sn x sn system 



■ hi - 


Xi 

(/ (g) A)x = &, where b = 

. bs _ 

, X = 

Xs 


and then apply a standard Krylov subspace method to the system (§. Using 
the matmxs operator to cast vectors x of length ns into matrices X of size nx s 
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“by columns”, i.e. {X)ij = and the identities 

matnxsiil <8) A)x) = AX, {x,y) = tT(Y^X) where 
X = ma,tnxs{x),Y = mat„xs(2/), 

we end up with the formulation of BiCG for ([^ given in Algorithm known as 
global BiCG (Gl-BiCG), see [H]. The algorithm breaks down if the nominator 
in or is zero. 


Algorithm 2: Global BiCG (Gl-BiCG) 

c 

P 

f< 

hoose G 

ut AX(°\ P(0) = 

or fc = 0,1, 2,... until convergence do 

Qik) ^ ^p(fc) 

= tr((pW)^pW)/tr((P('=))^Q('=)) 

Xik+l) — ^(fc) _|_ Q,(/c) p(k) 

Rik+1) ^ p(fe) _ a^k)Q(k)^ p(k+i) ^ p(k) _ ^(^A^pik)^ 

= tr((p('=+i))^p('=+i))/tr((pW)'f^pW) 

p(fc-l-l) _ p(.k+l) _|_ p(k) p{k) ^ p(k+l) _ p<,k+l) _|_ p(k) p(k) 


(1) 

In the loop interchange and the global variant of BiCG the ith columns 
for £ = 0,..., fc — 1 represent a basis for the Krylov subspace 

K.k{A,rf^) = span{r,^°\ 

and in both methods the iterate x^^\ the ith column of X^^\ is taken from 
x‘f'^ + Block Krylov subspace methods take advantage of the fact 

that the Krylov subspaces K,k{A, = 1,..., s are available simultaneously 

and take each iterate xi from the larger “block” Krylov subspace 

S 

■.= Y,Xk{A,rf^), 

i=l 

which has dimension sk (or less if linear dependence occurs “across” the single 
r.h.s. Krylov subspaces Xk{A, The usual approach is to transfer the vari¬ 

ational characterization for the iterates of the single r.h.s. method to the block 
Krylov subspace. In this manner, the block BiCG method from defines its it¬ 
erates (they appear again as the columns of an n x s block vector) by requir¬ 
ing that the iterates are from +Xk{A,R^^^) and their residuals are or¬ 
thogonal to Xk{A^, for some “shadow residual” block vector S 
The resulting method Bl-BiCG (block BiGG), is given as Algorithm]^ Note that 
now the quantities and are s x s-matrices, and the method 

breaks down prematurely if one of the matrices A^RX) is 

singular. 
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Algorithm 3: Block BiCG (Bl-BiCG) 

c 

P 

f( 

hoose e 

ut p(o) ^ AX(°\ P(0) = P(0),P(°) = P(0) 

or fc = 0,1, 2,... until convergence do 

Q{k) ^ ^p(fc), Q(k) = ^J/p(fc) 

^(k) ^ ((p(fc))ffQ(C)-i((p(C)J^p(fe)) 
a^k) = ((p(C)ffQ(C)-i((p(C)J^p(fe)) 

^(fe-l-1) _ (fc) _|_ p(fe)Q,(fe) 

Rik+1) ^ p(fc) _ Q(fc)Q,(fc)^ fi(k+i) ^ p(fc) _ g(fc)^(fc) 

^(fe) ^ ((pW)^^p(fe))-i((p(fc+i))^^p(fc+i)) 

/3(C = ((p(G)^^p(fe))-i((p(fc+i))^^p(fc+i)) 

p(fc-l-l) _ p(fc+l) _|_ p(fc)y3(fc)^ p(fc+l) = -p p(Cy3(fc) 


Apart from the matrix-vector multiplications, the work to update the var¬ 
ious quantities is substantially higher in block BiCG as compared to loop in¬ 
terchanged BiCG and global BiCG. Gounting an inner product or a SAXPY 
operation with vectors of length n as one vector operation (representing n ad¬ 
ditions and n multiplications), we have the following proposition, the proof of 
which also gives an indication on how to save arithmetic work by re-using in¬ 
formation from the previous iteration. 

Proposition 1. One iteration of Li-BiCG, Gl-BiCG or Bl-BiCG requires two 
matrix-block-vector multiplications (one with A and one with ) with block- 
vectors of size n X s plus 7s additional vector operations for Li-BiCG as well as 
Gl-BiCG, and 7s^ additional vector operations for Bl-BiCG. 

Proof. We only have to care about operations other than the multiplications 

with A and A^. In Li-BiCG, assuming that we save for re-use in 

the next iteration, we need a total of 2s inner products to compute all 

and s SAXPYs for each of p(fc+i), p(fe+i). Exactly 

the same count holds for Gl-BiCG. In Bl-BiGG we can save 

for use in the next iteration, and we can exploit the fact that the two factors 

defining are just the adjoints of those defining and similarly for 

and So we need just 2s^ inner products to build these factors, and we 

neglect the cost 0{s^) for multiplying s x s matrices. The computation of each 

of p(fc+i), p(fc+i) requires s^ SAXPYs. 

We note that the updates of p('=+i), p(*^+i) in the 

block method actually represent BLASS GEMM operations, see [5] which have 
a more favorable ratio of memory access to arithmetic work than SAXPY op¬ 
erations, so the overhead of a factor of s of the block method vs. the loop 
interchange and the global method suggested by Proposition may show less 
in actual timings. 
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In a similar way one obtains the three simultaneous variants Li-QMR, Gl- 
QMR and Bl-QMR of the QMR method. The variational characterization in 
Bl-QMR is that the 2-norm of the coefficients which represent the residual from 
Ki;{A, in the nested bi-orthogonal basis of Ki~{A, with respect to 

Ki;{A^, be minimal; see m- We do not write down the resulting al¬ 

gorithms explicitly nor do we so for the BiCGStab variants Li-BiCGStab, Gl- 
BiGGStab and Bl-BiGGStab. It is worth mentioning, though, that BiCGStab 
does not have a proper variational characterization, its main property being 
that multiplications with A^ present in BiCG are replaced by those with A, 
thus obtaining iterate from -I- while imposing a steepest 

descent property on the residual as an intermediate step. The block BiCGStab 
method from m transfers this approach to the multiple r.h.s. case such that 
the fc-th iterate is from + IC 2 k{A, R^^'>), imposing a condition on the 
residuals in the intermediate steps which, interestingly, is quite in the spirit of 
a “global” steepest descent method. We refer to m for details. 

3. Discussion of the literature 

The loop interchange approach is most probably not new, although we are 
not aware of a systematic discussion as a construction principle for simultaneous 
methods. 

Global methods were considered in a variety of papers. Surprisingly, al¬ 
though they all just realize the approach to perform the respective single r.h.s. 
method for (§ and then “matricize” all operations with vectors of length ns, we 
could not find this fact mentioned explicitly in the literature. The first global 
methods are global full orthogonalization (Gl-FOM) and global generalized min¬ 
imal residual (Gl-GMRES), introduced in [2^. Gl-BiCG and Gl-BiCGStab 
were suggested in [2T] , and global variants of less well-known Krylov subspace 
methods were subsequently proposed in [TS] (Gl-CMRH), [33] (Gl-CGS), [13] 
(Gl-SCD) and [3S] (Gl-BiCR and its variants). 

Block Krylov subspace methods were introduced in |33|, where Bl-BiGG was 
considered together with block conjugate gradients (Bl-GG). The (long recur¬ 
rence) block generalized minimal residual (Bl-GMRES) algorithm goes back to 
m and |33|, and a convergence analysis can be found in |35|. Block Krylov 
subspace methods require additional work for factorizations and multiplications 
of n X s or s X s matrices in each iteration (cf. Proposition . Very often, 
this substantially slows down the overall process. The hybrid method from [33] 
describes an approach, where this additional work is reduced by adding cycles 
in which a (matrix valued) polynomial obtained from the block Arnold! process 
of a previous cycle is used. The additional cycles save arithmetic cost since they 
do not perform the block Arnold! process. 

The Bl-QMR method was suggested in [TT], Bl-BiGGStab goes back to [IH] . 
and Bl-LSQR, a block least squares QR-algorithm to solve the normal equations 
for simultaneous r.h.s. was given in |22| . The block methods relying on the non- 
symmetric Lanczos process can suffer from the fact that this process can break 
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down prematurely if the bi-orthogonality condition imposed on the to be com¬ 
puted block basis vectors can not be fulfilled. Numerically, this will typically 
result in very ill-conditioned s x s matrices in Bl-BiCG and Bl-BiCGStab for 
which linear systems have to be solved in the algorithm. The non-symmetric 
Lanczos process at the basis of Bl-QMR from [T] does, in principle, not suffer 
from this phenomenon since it incorporates a look-ahead strategy which cures 
almost all possible unwanted breakdowns!^ Both, the block Arnold! process at 
the basis of Bl-GMRES as well as the symmetric or non-symmetric Lanczos pro¬ 
cess at the basis of Bl-GG or Bl-BiGG, Bl-QMR and Bl-BiGGStab, respectively, 
should account for an additional danger of producing ill-conditioned s x s sys¬ 
tems arising because of deflation. Occurrence of deflation within a block Krylov 
subspace method means that while the block Krylov subspaces ICk{A, or 
K,k{A^, has dimension ks, the next, K,k+i{A, or K.k+i{A^, has 

dimension less than {k + l)s. Similar reductions in dimension might occur again 
in later iterations. In principle, deflation is advantageous, since it allows to 
reduce the number of matrix-vector operations per iteration. However, algo¬ 
rithms have to be adjusted, and additional book-keeping is necessary. Deflation 
in the unsymmetric block Lanczos process and the block Arnold! process was 
considered in |4], the consequences for Bl-GMRES are discussed in detail in 
[T7] . Bl-QMR from [Tl] and the Bl-GG variant considered in [3] address defla¬ 
tion by indeed explicitly reducing the dimension of the block Krylov subspaces. 
Interestingly, the latter two methods now work by applying the matrix-vector 
multiplications one at a time instead of simultaneously to a block vector, check¬ 
ing for deflation after each such operation. These methods can thus no longer 
take advantage of a possible performance gain due to multiplications of matri¬ 
ces with block-vectors. This is in contrast to the modifications of Bl-BiGG and 
Bl-GG from and [S] which “hide” the possible singularity of the sx s matri¬ 
ces by using QR-decompositions and modified update formulae. This approach 
to treat deflation implicitly does not allow to save matrix-vector multiplications 
when deflation occurs, but keeps the advantage of relying on matrix-block-vector 
multiplications. It does not require any additional book-keeping. Recently, in 
[5] a variant of Bl-GG was developed which treats deflation explicitly, but still 
uses matrix-block-vector multiplications. 

A round-off error analysis for Bl-BiGGStab in [3T] lead the authors to sug¬ 
gest a numerically more stable modification which basically interchanges the 
minimization step and the BiGG step present in each iteration of Gl-BiGGStab. 
A numerically more stable variant of Bl-BiGG which enforces A-orthogonality 
between the individual vectors and not just between block vectors, and which 
contains an additional QMR-type stabilization step, was considered in |2f)j . 


^This cure is somewhat cumbersome to implement, though. 
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4. Improvements: Cost and stability 

In any of the three simultaneous BiCG methods we are free to choose 
the initial block-vector of shadow residuals, popular choices being 
or If we take to have identical columns, i.e. 

^(0) ^ f(0) g 1 = (1,..., e C*, 

we see that in Gl-BiCG (Algorithm]^ all “shadow” block vectors and 
conserve this structure for all k, i.e. 

p(k) eC”. 

A comparable situation occurs neither in Li-BiGG nor in Bl-BiGG. Only in 
Gl-BiGG can we therefore take advantage of an initial shadow residual with 
identical columns and just compute the vectors and rather than the 
respective block vectors in iteration k—1. In particular, we then need to multiply 
only with a single vector instead of a whole block vector with s columns. 
The resulting “economic” version eGl-BiGG of Gl-BiGG is given as Algorithm]^ 
where we used the sum-operator to denote the sum of all components of a (row) 
vector and the relation tr(lf^P) = sum(f^P) for 1 G C®, f G C", R G C"^®. 


Algorithm 4: Economic global BiGG (eGl-BiGG) 

c 

P 

f( 

hoose e e C" 

ut p(o) ^ 5 _ AX^°\ P(0) = 
ar fc = 0,1, 2,... until convergence do 

Q{k) _ ^p(fc) 

= sum((f (^^)^p(^))/sum((p*^*^^)^(3^*^^) 

Xik+i) — xW p('=) 

R(k+1) = p(fe) _ a‘^k)Q{k)^ fik+1) ^ ^(fc) _ ^(^A^pik)-^ 

/?w = sum((f(''+i))'f^p('=+i))/sum((fW)^pW) 

p(fc-l-l) _ p(fc+l) _|_ p(k}^ pik+1) — fik+1) _|_ fj{k)p(k) 


In a similar way, we obtain an economic version, eGl-QMR, of the global 
QMR method. There is no such opportunity for Gl-BiGGStab, where multipli¬ 
cations with are not present anyway. In the economic global methods the 
work to be performed for the shadow residuals and search directions is identical 
to that for just one single r.h.s., so that it becomes increasingly negligible for 
larger values of s, s ^ 10 say. In this manner, the economic global methods 
eliminate one of the important disadvantages of the short recurrence Krylov 
subspace methods based on the non-symmetric Lanczos process, and we will see 
in the numerical experiments that eGl-BiGG and eGl-QMR perform favorably 
when compared with simultaneous BiGGStab variants, for example. 

Deflation in Bl-BiGG occurs if at some iteration k one of the block vectors 
p(^),p(^),p(^), and becomes rank deficient, even though one might have 






taken care for this not to happen in iteration 0. In practice, exact deflation 
happens rarely, but if one of the block vectors is almost rank deficient (inexact 
deflation), some of the matrices and will be computed inac¬ 

curately, resulting in numerical instabilities of the algorithm. Already in [23] it 
was therefore proposed to use a QR factorization of the search direction block 
vectors and to avoid such instabilities. A systematic study of different 
other remedies was presented in [5] for the Hermitian positive definite case, i.e., 
for the block CG method, and it turned out that the variant which uses a QR 
factorization of the residual block vectors is most convenient. This approach can 
be transported to block BiCG, and we performed a series of numerical compar¬ 
isons which shows that also in the non-Hermitian case the variant which relies 
on a QR factorization of the block residuals is to be preferred over the one with 
QR-factorization of the block search vectors. 

To precisely describe the resulting variant of block BiCG, consider a (thin) 
QR-factorization of the block residuals in Bl-BiCG (Algorithm]^, 

rW = qWcW, (3) 


where 

Q(fc),Q(fc) g = J, e 

A possible ill-conditioning of or translates into an ill-conditioned ma¬ 
trix or respectively, and Bl-BiGG can now be stabilized by using the 
QR-factorizations (|^ while at the same time avoiding the occurrence of 
and (C^^))”^. To do so, we represent the search direction block vectors 
and P^^'^ as 

p(k) _ y(k)(j(k) ^ p(k) _ y(k)pi(k)^ 

For the update of the block vectors and from Algorithm we then 
get 

y<,k+l) _ Q(k+1) J^y(k) p{k) 

y(.k+l) _ Q(k+1) _^y{k) ^p{k)^{k) ^yi{k+l)-^-l'^^ 

and using ([^ in the update for the block residuals we obtain 
Q{k+i)y<{k+i)f^y,{k)y^ = qW - 

Q(k+l) p(k+l) _ Q(k} _ j^Hy(k) ^p{k)^(k)^p{k)'^-iy 


This shows that computationally we can obtain ^ to¬ 
gether with from a QR factorization of , 

and similarly for the “shadow” block vectors. Moreover, we have 

= ((F('=))"AF('=))”^((Q('=))^Q('=), (4) 
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and, analogously, 


(7(fe)^(fe)(C'(fe+i))-i = (6) 

(7(fe)^(fe)((7(fe+i))-i = ((QW)^Q('=))”\s'('=+i))^((Q('=+i))^g(*^+i)). (7) 

Putting things together, we arrive at the block BiCG algorithm using QR- 
factorization of the block residual, termed Bl-BiCG-rQ, given as Algorithmic 


Algorithm 5: Block BiCG with QR factorization of residual block vectors 
(Bl-BiCG-rQ) 

c 

P 

0 

f< 

hoose e 

ut =B- AA(o), = 

)(0),t>(0) = Q(0) 

or fc = 1, 2, ... until convergence do 

VF('=) = AV'^^\ 

^(fe) = ((F(fc))^^]4/(fc))-i((g(fe))^^Q(G) 

+ F(GQ,(fe)c'(fe) 

Qik+l) g(k+l) _ Q{k) _ y[r(k)^(k) ^ (j(k+l) — g{k+l)(j(k) 

Q(k+l)g(k+l) _ Q(k) _ }^{k)g^(k) ^ ^(k+1) _ g(k+l)^{k) 

/?('=) = 

/3(G = ((Q('=))^Q(G)-i(5'(fc-ti))^^((Q(fc-ti))ffQ(fc+i)) 

y(k+l) _ Q<,k+l) _|_ y(k)p{k) ^ y<,k+l) _ Q(k+l) _|_ y(k)p(k) 


Several remarks are in order: First, in Algorithm [C we re-used the sym¬ 
bols etc. which now designate the quantities at the right of @-0- 

Second, although the block residuals are no more available explicitly in 
the algorithms, their 2-norms and Frobenius norms can still be easily retrieved 
and used in a stopping criterion, since both these norms are invariant under 
orthogonality transformations and thus 

p(fc)|| = ||Q(Gc'(fc)|| = ||C('=)||, 

being available from the algorithm. Third, Algorithmic requires more work 
than Algorithm |C mainly because of the additional two QR-factorizations of 
block vectors required in each iteration. They have at least a cost of 3ns'^+0{ns) 
each (using, e.g., the modified Gram-Schmidt algorithm). Finally, all block 
vectors in Algorithm |C now always have full rank, thus reducing one source 
of possible ill-conditioning in etc. In the case of A being Hermitian 

and positive definite, where BiGG can be reduced to CG, this is an exhaustive 
cure to possible ill-conditioning, see [9]. In BiGG, however, it can still happen 
that, e.g., or is ill-conditioned or singular. This is 
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inherent to the bi-orthogonality condition on which the whole method is built, 
and can be avoided only if one is willing to deviate from the bi-orthogonality 
condition by, for example, modifying the method using a look-ahead version of 
the unsymmetric block Lanczos process. 

Relying on the deflation procedure from [T] within a look-ahead block Lanc¬ 
zos process, the block QMR method from m addresses possible rank-deficiency 
in the block residuals as well as possible further breakdowns related to the bi¬ 
orthogonality condition. In this approach, the basis vectors for the next Krylov 
subspace are built one at a time and are then checked for deflation, so that this 
approach does not allow to compute matrix-block-vector products. 

To conclude this section, we remark that it is possible to, in a similar spirit, 
use a QR-factorization of the residuals in the block BiCGStab method. We 
do not write down the resulting algorithm, Bl-BiCGStab-rQ explicitly, but we 
will report on its performance in our numerical experiments. A corresponding 
variant based on QR factorization of the search directions performed less well 
in our experiments, as was the case in Bl-BiCG. 

5. Numerical experiments 

The purpose of this section is to assess the efficiency and stability of the 
various simultaneous methods. To this end we performed numerical experiments 
for five different examples in which systems with several r.h.s. arise naturally. 
These examples will be described in detail below. All iterations were started 
with = 0. In all examples we preprocessed the block vector B of r.h.s. by 
computing its QR-factorization, B = QR and then replaced B by Q. The initial 
shadow block residual was taken equal to the initial residual, (= Q), 

except for the economic versions where we took as the arithmetic mean of all 
initial residuals. The stopping criterion was always < 10“^° • ||i?. 

All experiments were run in Matlab on an Intel Gore 17-4770 quad core pro¬ 
cessor. Most of our Matlab code is based on pre-compiled Matlab functions 
and routines, e.g. for computing sparse matrix-vector products, factorizations 
or products of matrices. We therefore trust that the reported timings are sig¬ 
nificant with the possible exception of the loop interchanged methods where the 
explicitly occurring inner loop over the r.h.s. introduces a non-negligible part of 
interpreted code, thus increasing the compute time. 

For our Matlab experiments, the ratio a of the time required to perform s 
single matrix-vector multiplications and that for a matrix-block vector multi¬ 
plication with block size s ranges between 1.2 and 2.7. The benefit of working 
with block vectors is thus substantial. In our case, this gain is primarily due to 
the fact that Matlab uses all cores for block vectors, whereas it uses only one 
core for a single vector. On other processors and with other compilers similar 
gains do occur albeit sometimes for different reasons. We refer to for a re¬ 
cent publication where the benefits of matrix-block vector multiplications in the 
presence of limited memory bandwidth are investigated based on an appropriate 
“roofline” performance model. 
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Table 1: Number of r.h.s. s, measured acceleration factor a (ratio of the time for s matrix- 
vector multiplications to the time for one matrix-block vector multiplication with block size 
s), and value of p from for the numerical examples. 



Ex.| 
no prec. 

prec. 

Ex.[ 
no prec. 

prec. 

Ex. 1^ 
no prec. 

prec. 

Ex. 1^ 
no prec. 

!l 

prec. 

Ex.g 
no prec. 

s 

4 


19 


19 


12 


20 

a 

2.65 

1.21 

2.66 

1.69 

2.60 

1.71 

1.57 

1.18 

1.38 

p 

0.8 

0.36 

2.76 

1.29 

2.76 

1.29 

0.24 

0.14 

5.00 


The additional work to be performed in the block methods (see Proposi¬ 
tion]^ is of the order of O(s^) vector operations, s the number of right hand 
sides. This must be considered relative to the work required for s matrix-vector 
products, which amounts to s ■ imz jn vector operations, nnz being the number 
of non-zeros in A. Therefore, the ratio 


9 

s n 

p = - r =- 

s ■ nnz/n nnz 


( 8 ) 


can be regarded as a quantity measuring the (relative) additional cost for s r.h.s. 
in the block methods. If p is large, the additional arithmetic cost is substantial. 

Notwithstanding the detailed description of our examples below. Table 
reports the values of the various quantities just discussed for these examples. For 
Examples[T]|^we will also report results using ILU preconditioning. Besides from 
accelerating the convergence, preconditioning increases the cost of a matrix- 
(block)-vector multiplication which is now a preconditioned matrix-block vector 
multiplication. This affects the ratio a and reduces the ratio p. Table [^therefore 
gives values for the unpreconditioned and the preconditioned cases. Since none 
of the standard preconditioners was effective for Example we only report 
results for the unpreconditioned case there. 


5.1. Description of examples 

We considered the following five examples. 

Example 1. This is a variation of an example which was given in |19j in the 
context of Krylov subspace type methods for the Sylvester equation. We con¬ 
sider the elliptic pde 

’^xx ^yy T = 0, 

on the unit square with Dirichlet boundary conditions, where ai = = 5. 

We discretize using finite differences on an equispaced grid with 200 x 200 interior 
grid points. We build a total of four r.h.s. associated with the four corners of D. 
For a given corner, we construct the r.h.s. such that it represents the boundary 
condition of u being continuous and piecewise linear on the boundary with value 
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1 at that corner and value 0 at all the other corners. From the four solutions 
to these four r.h.s. we can thus, by superposition, retrieve the solutions for any 
continuous piecewise linear boundary values for u. 

This is an example of an elliptic pde with mild convection, the value for p 
from is relatively small, p = 0.8. 

Example 2. We consider the three-dimensional advection dominated elliptic 
pde on the unit cube 

'^XX “t“ Uyy “t" Uzz “t“ ^ * ILx - ^ - (O, l) , (9) 

where > 0 is a parameter which controls the influence of the convection term 
and / is defined such that for zero Dirichlet boundary conditions the exact 
solution is u{x,y,z) = exp(x?/ 2 :) sin(7ra;) sin(7rj/) sin(7rz), see [30] as an example 
where BiCGStab encounters convergence problems for larger values of ly. We 
discretize using finite differences on an equispaced grid with 50^ interior 
grid points. This results in a linear system of size n = 125,000 with seven 
non-zeros per row. We generate one r.h.s. by imposing zero Dirichlet boundary 
conditions on all faces of D. In order to be able to (by superposition) retrieve 
solutions for any Dirichlet boundary conditions which depend piecewise affinely 
on X, y and z on any of the six faces of D, we generate a total of 18 additional 
r.h.s. with / = 0 in To be specific, for the face with a; = 0, e.g., we set 
the boundary conditions equal to 0 on all other faces and obtain three r.h.s. 
by setting the boundary condition on that face equal to y, to z and to the 
constant 1, respectively, and similarly for the five other faces. Our choice for 
V is v = 1,000, as in [30] . resulting in a convection dominated equation and a 
highly non-symmetric system matrix. The value for p from Q is significantly 
larger than in the first example, p « 2.7. 

Example 3. Same as Example but now with v = 10. This means that the 
matrix is “almost” symmetric. 

Example 4. This example arises from quantum chromodynamics (QCD), the 
physical theory of the strong interaction between the quarks as constituents of 
matter. The Wilson-Dirac matrix Dw in lattice QCD represents a discretization 
of the Dirac operator from QCD, see m, e.g. One has Dw = I — kD with D 
representing a periodic nearest neighbor coupling on a four-dimensional lattice, 
and there are 12 variables per lattice point, one for each possible combination 
of four spin and three color indices. The couplings are defined via a gluon 
background field which is drawn from a statistical distribution and therefore 
varies irregularly from one lattice point to the next. A typical “propagator 
computation” in lattice QCD amounts to solve 

D\YXi Cj, t 1,..., 12, 

where ei,..., ei 2 are the first s = 12 unit vectors (corresponding to the twelve 
variables at one lattice point). For our computation we took a Wilson-Dirac 
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matrix based on a 8"^ lattice and thus has dimension 12-8^ = 49,152. Our 
configuration corresponds to a typical configuration in lattice QCD with tem¬ 
perature parameter /3 = 5.6 and coupling parameter k = 0.16066 corresponding 
to run ^3 in HIT]. Similar matrices can be found e.g. as conf 5_4-8x8-20 and 
conf 6_0-8x8-80 from the University of Florida Sparse Matrix Collection [^. 
The entries of Dw are complex numbers, the matrix is non-symmetric with its 
spectrum being symmetric to the real axis and located in the right half plane. 
There are 49 non-zeros per row in Dw- This is our example with the smallest 
value for p from (|^, p « 0.25. 

Example 5. The FEAST method to compute some eigenpairs [H] for the gen¬ 
eralized eigenvalue problem Ax = XBx evaluates the contour integrals j>p(A — 
tB)~^BYdt where F is a circle that contains the eigenvalues of the eigenpairs 
that are to be computed and Y G consists of m randomly chosen columns 

Ui. Using numerical quadrature for the integral, this means that one has to solve 
several linear systems {A — tjB)xi = yi for a given choice of the quadrature 
nodes tj. For our example, we took A as the matrix stemming from a model 
for graphene belonging to the parameter W = 200 in [14] with n = 40,000, 
and B = I. There, the eigenpairs corresponding to the eigenvalues of smallest 
modulus are sought, i.e. F is centred at 0 with a radius of 0.125. We solve the 
systems {A — tl)xi = yi for t = —0.0919 -I- 0.0848* (which corresponds to Z 3 in 
El) and 20 random right-hand sides yi. Here, the value for p from Q is p = 5, 
the largest value in our examples. 

Let us note that this example lends itself to a “shifted” Krylov subspace 
approach where systems are simultaneously solved for various values of tj, but 
this is less relevant in our context. 

5.2. Stabilization of block BiCG and block BiCGStab 

We first report on a comparison of the block BiCG method. Algorithm 
and block BiCGStab from |21j with the versions which improve stability using 
a QR-factorization of the block residuals. For block BiCG, this version is given 
explicitly in Algorithm]^ 

Figure[^shows convergence plots for Examplej^without preconditioning (top 
row) and with a (right) no-fill ILU preconditioner (bottom row) obtained via 
the Matlab function ilu. The left plots show the relative Frobenius norm of the 
block residual as a function of the total number of matrix-vector multiplications 
(there are 2s of those per iteration). The right plots show the same resid¬ 
ual norms, but now as a function of wall clock time. Figure shows that both, 
Bl-BiGG as well as Bl-BiCGStab, can significantly improve when the QR factor¬ 
izations are used, sometimes making iterations converge which otherwise would 
diverge. In the presence of preconditioning, where the preconditioned matrix 
is better conditioned and convergence is reached after a relatively small num¬ 
ber of iterations, QR factorizations has a less pronounced effect, and standard 
Bl-BiGGStab actually now converges, too. Since computing a QR-factorization 
of a block of s vectors of length n requires an arithmetic cost of O(s^) vector 
operations, this cost is substantial unless p is really small. From the plots in 
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time 




Figure 1: Convergence plots for Bl-BiCG and Bl-BiCGStab and their variants using a QR- 
factorization of the block residual. Top: Example^ bottom: Example^ with ILU precondi¬ 
tioning. 


the right column of Figure we indeed see that this additional cost is impor¬ 
tant and that more than outweighs the small gains in terms of the number of 
matrix-vector multiplications. 

Table summarizes the results for all our examples. For all four methods, 
it reports the number of matrix-vector multiplications and the time needed 
to reach the stopping criterion (reduction of the initial block residual by a 


Table 2: Iteration counts ^^it, wall clock times and final residual norms ||R|| for Bl-BiCG and 
its variant with QR-factorization of the block residual (top two rows), and their counterparts 
for Bl-BiCGStab (bottom two rows). 


variants of Bl-BiCG (first two rows) and Bl-BiCGStab (last two rows), no precond. 

Example 111 
time ||K|| 

Example 
#it time [f7t|| 

Example |3| 
time [[7x|| 

Example wl 

#it time [f7x|| 

Example |5| 

#it time \[B\\ 

500 2.28s div. 

500 3.85s le-08 

500 43.83s div. 
500 101.1s div. 

500 44.99s 0.28 
156 31.66s 2e-10 

500 62.92s le-03 
237 38.85s 3e-10 

528 36.49s 3e-10 

533 79.86s 3e-10 

500 2.46s 0.093 
355 2.96s le-10 

500 45.15s div. 
500 91.17s div. 

111 9.73s 2e-10 

101 18.10s 2e-10 

175 20.66s 2e-10 
157 25.07s 2e-10 

1200 77.72s 2e-3 
957 131.7s 3e-10 


variants of Bl-BiCG (first two rows) and Bl-BiCGStab (last two rows), ILU precond. 

Example 111 
time ||k|| 

Example 
#it time [f7?|| 

Example |3| 
time [[7t|| 

Example Kl 

#it time [f7t|| 


500 7.72s 0.04 
163 2.89s le-10 

28 6.13s 2e-10 

28 9.18s 8e-ll 

50 11.35s 3e-10 

50 17.06s 3e-10 

68 23.33s 2e-10 

62 23.72s 3e-10 


113 1.38s 3e-10 
115 1.69s 3e-ll 

16 2.74s 4e-ll 

16 4.09s 3e-ll 

33 5.87s 6e-ll 

33 8.84s 5e-ll 

41 10.15s 2e-10 

40 11.72s 7e-ll 
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factor of 10“^° or maximum of 500 iterations reached in Examples T]|4 1,200 
for Example . We also report the final relative norm of the block residual 
which was explicitly re-computed upon termination. If this final residual norm 
is larger than 1, we interpret this as divergence, noted as “div.” in the table. 
Smaller residual norms like 10“"^ may be interpreted as an indicator of slow 
convergence, too slow to be competitive within our setting. 

Table confirms our conclusions from the discussion of Eigure QR- 
factorization improves numerical stability, and it has the potential to turn oth¬ 
erwise divergent iterations into convergent ones. Its (relative) additional cost is 
indicated by the value of p from ([^ , and it is relatively small for Examples 
and 1^ whereas it is relatively high for the other examples where it thus does 
not pay off in the cases where the non-stabilized method is already conver¬ 
gent. The known convergence problems of BiCGStab for Example]^ see m, 
carry over to both variants of the block method, and we may conclude that the 
non-convergence is not a matter of numerical stability but of BiCGStab and 
its block counterpart not being able to efficiently accommodate the relevant 
spectral properties of the respective matrix; see also the discussion in [5D] . 


5.3. Comparison of all methods 

We proceed by comparing the number of matrix-vector multiplications and 
the wall clock times for all methods. Figure does so for the case without 
preconditioning. We use bar plots to allow for an immediate visual comparison 
of the methods and we group our measurements by “families” of methods. Since 
Example needs much more iterations than the other examples, we scaled the 
bar plots for this example by dividing by 5. A missing vertical bar indicates 
that the corresponding method either diverged or stagnated at a large residual 
norm. Instead of implementing a Bl-QMR method by ourselves we used a 
Matlab implementation by Freund and Malhotra which was publicly available 
at least until 2008 at the web site of Lucent technologies. This implementation 
is very cautious about possible ill-conditioning and (near) deflation, and for this 
reason it actually does all matrix-vector products one at a time, i.e. it does not 
use matrix-block vector products. As a result, this block QMR implementation 
can not be competitive with respect to timings, and for the unpreconditioned 
systems considered here we obtained premature break downs in all examples. 
The situation becomes slightly different in the preconditioned case, see below. 

We can make the following observations: The loop-interchanged and the 
global methods require almost the same number of matrix-vector multiplications 
in all examples. Since the other arithmetic work is also comparable, this should 
result in similar wall clock times. This is, however, not what we see, the loop 
interchanged methods requiring substantially more time. We attribute this to 
the fact that the loop interchanged methods are the only ones were there is a 
non-negligible amount of interpreted Matlab code. “Plain” block methods suffer 
from non-convergence in a significant number of cases (8 out of 10). Using a 
QR-factorization of the residuals reduces the number of failures to 2. When they 
work, the block methods reduce the number of matrix-vector multiplications as 
compared to loop-interchanged or global methods by never more than a factor 
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of 2. Because of the additional arithmetic cost, block methods never win in 
terms of wall clock time, and for larger values of p (Examples and they 
require more than twice the time than the corresponding global method across 
all “families”. The economic versions of the global methods always appear as 
the best within their respective “family”. They do not exhibit convergence 
problems, and they reduce the number of matrix-vector multiplications when 
compared with the global and loop interchanged methods. At the same time, 
their wall clock times are smallest. Overall, the economic global BiCG method 
appears to be the most efficient one, in tie with global BiCGStab for Example [l] 
and block BiGGStab for Example]^ Note that these are the examples with the 
smallest value for p. 

Figure shows the results for Examples T]|4 where we now use ILU (right) 
preconditioning. More precisely, we used Matlab’s ilu do obtain a no-fill ILU 
in Examples [l][^ and an ILU with threshold and pivoting with a drop tolerance 
of 5 • 10“^ in Example]^ The number of matrix-vector multiplications and the 
wall clock times decrease for all methods as compared to the un-preconditioned 
case. Block QMR now converges for Example but its wall-clock time is not 
competitive for the reasons explained earlier. The other block methods fail 
only once for BiGG, and never for BiGGStab. While the economic versions of 
the global methods still yield the best timings within their “family”, the block 
BiGGStab methods and global BiGGStab now perform better in terms of wall 
clock time for all four examples. The value of p being particularly small for 
Example]^ this is the example where block BiGGStab gains most against loop 
interchanged or global BiGGStab. 

Summarizing our findings we can conclude that one should use the block 
methods with particular care since there is a danger of non- convergence due to 
numerical instabilities. This can be somewhat attenuated by using the suggested 
QR-factorization of the block residuals. The additional arithmetic cost in the 
block methods should not be neglected, and—depending on p — there must be 
a substantial gain in matrix-vector multiplications in the block methods if this 
additional cost is to be outweighed. Global methods and loop interchanged 
methods require about the same amount of matrix-vector multiplications and 
additional arithmetic cost, so that they should require about the same time, 
too, if it were not for special effects in a Matlab implementation which mixes 
compiled and interpreted code. The economic versions of global BiGG and 
global QMR appear to perform well with respect to both, stability and efficiency. 
For better conditioned systems, e.g. when an efficient preconditioner is at hand, 
the block BiGGStab methods and global BiGGStab behave stably, and then 
their runtime is less than for the economic global methods. 


References 

[1] J. 1. Aliaga, D. L. Boley, R. W. Freund, V. Hernandez, A Lanczos-type 
method for multiple starting vectors. Math. Gomp. 69 (2000) 1577-1601. 


18 



matrix-vector multiplications until convergence matrix-vector multiplications until convergence matrix-vector multiplications until convergence 


BiCG family of methods, with ILU prec. 


BiCG family of methods, with ILU prec. 




QMR family of methods, with ILU prec. 


QMR family of methods, with ILU prec. 




BiCGStab family of methods, with ILU prec. 

3500 
3000 
2500 
2000 
1500 
1000 
500 

° Ex. 5.1 Ex. 5.2 Ex. 5.3 Ex 5.4 



BiCGStab family of methods, with ILU prec. 


^■Li-BiCGStab 
■ GI-BiCGStab 

□ BI-BiCGStab 

□ BI-BiCGStab-rQ 









■■nn 



I 


Ex. 5.1 Ex. 5.2 Ex. 5.3 


Figure 3: Number of matrix-vector multiplications and time for all methods, Examples |H4| 
with ILU preconditioning. 


19 






















































































































































[2] S. Birk, Deflated shifted block Krylov subspace methods for hermitian pos¬ 
itive definite matrices, Ph.D. thesis, Bergische Universitat Wuppertal, to 
appear (2015). 

[3] S. Birk, A. Frommer, A deflated conjugate gradient method for multiple 
right hand sides and multiple shifts, Numer. Algorithms 67 (2014) 507-529. 

[4] J. Cullum, T. Zhang, Two-sided Arnoldi and nonsymmetric Lanczos algo¬ 
rithms, SIAM J. Matrix Anal. Appl. 24 (2002) 303-319. 

[5] T. A. Davis, Y. Hu, The University of Florida Sparse Matrix Collection 
38 (1) (2011) 1:1-1:25. 

URL http://doi.acm.org/10.1145/2049662.2049663 

[6] L. Del Debbio, L. Giusti, M. Liischer, R. Petronzio, N. Tantalo, QCD with 
light Wilson quarks on fine lattices (I): First experiences and physics results 
arXiv 0702:056. 

[7] L. Del Debbio, L. Giusti, M. Liischer, R. Petronzio, N. Tantalo, QCD with 
light Wilson quarks on fine lattices (II): DD-HMC simulations and data 
analysis arXiv 0702:082. 

[8] J. J. Dongarra, J. Ducroz, 1. Duff, S. Hammarling, A set of level 3 Basic 
Linear Algebra Subprograms, ACM Trans. Math. Softw. 16 (1988) 1-17. 

[9] A. A. Dubrulle, Retooling the method of block conjugate gradients. Elec¬ 
tron. Trans. Numer. Anal. 12 (2001) 216-233. 

[10] R. Fletcher, Conjugate gradient methods for indefinite systems, in: G. Wat¬ 
son (ed.). Proceedings of the Dundee Biennial Conference on Numerical 
Analysis, Springer-Verlag, New York, 1975. 

[11] R. W. Freund, M. Malhotra, A block-QMR algorithm for non-Hermitian 
linear systems with multiple right-hand sides, Linear Algebra Appl. 254 
(1997) 119-157. 

[12] R. W. Freund, N. M. Nachtigal, QMR: A quasi-minimal residual method 
for non-hermitian linear systems, Numer. Math. 60 (1991) 315-339. 

[13] A. Frommer, K. Kahl, S. Krieg, B. Leder, M. Rottman, Adaptive aggre¬ 
gation based domain decomposition multigrid for the lattice Wilson Dirac 
operator, SIAM J. Sci. Comp. 36 (4) (2014) A158I-A11608. 

[14] M. Galgon, L. Kramer, J. Thies, A. Basermann, B. Lang, On the parallel 
iterative solution of linear systems arising in the FEAST algorithm for 
computing inner eigenvalues. Preprint BUW-IMACM 14/35, submitted 
(2014). 

URL http://www.imacm.uni-wuppertal.de/fileadmin/imacm/ 

preprints/2014/imacm_14_35.pdf 


20 


[15] C. Gu, Z. Yang, Global SGD algorithm for real positive definite linear 
systems with multiple right-hand sides, Appl. Math. Gomput. 189 (2007) 
59-67. 

[16] A. E. Guennouni, K. Jbilou, H. Sadok, A block version of BiCGSTAB 
for linear systems with multiple right-hand sides. Electron. Trans. Numer. 
Anal. 16 (2003) 129-142. 

[17] M. H. Gutknecht, Block Krylov space methods for linear systems with 
multiple right-hand sides: an introduction, in: A. H. Siddiqi, I. S. Duff, 
O. Christensen (eds.). Modern Mathematical Models, Methods and Algo¬ 
rithms for Real World Systems, Anamaya Publishers, New Delhi, India, 
2007, pp. 420-447. 

[18] M. Heyouni, The global Hessenberg and global CMRH methods for linear 
systems with multiple right-hand sides, Numer. Algo. 26 (2001) 317-332. 

[19] D. Y. Hu, L. Reichel, Krylov subspace methods for the Sylvester equations. 
Linear Algebra Appl. 174 (1992) 283-314. 

[20] K. Jbilou, A. Messaoudi, H. Sadok, Global FOM and GMRES algorithms 
for matrix equation, Appl. Numer. Math 31 (1999) 49-63. 

[21] K. Jbilou, H. Sadok, A. Tinzefte, Oblique Projection Methods for Linear 
Systems With Multiple Right-hand Sides, Electron. Trans. Numer. Anal. 
20 (2005) 119-138. 

[22] S. Karimi, F. Toutounian, The Block Least Squares Method for Solv¬ 
ing Nonsymmetric Linear Systems with Multiple Right-hand Sides, Appl. 
Math. Gomput. 177 (2006) 852-862. 

[23] D. P. O’Leary, The Block Conjugate Gradient Algorithm and Related 
Methods, Linear Algebra Appl. 29 (1980) 293-322. 

[24] E. Polizzi, Density-matrix-based algorithm for solving eigenvalue problems, 
Phys. Rev. B 79 (2009) 115112. 

[25] M. Rohrig-Zollner, J. Thies, M. Kreutzer, A. Alvermann, A. Pieper, 
A. Basermann, G. Hager, G. Wellein, H. Fehske, Increasing the perfor¬ 
mance of the Jacobi-Davidson method by blocking, submitted. 

URL http://elib.dlr.de/89980/ 

[26] V. Simoncini, A stabilized QMR version of block BiCG, SIAM J. Matrix 
Anal. Appl. 18 (1997) 419-434. 

[27] V. Simoncini, E. Gallopoulos, An iterative method for nonsymmetric sys¬ 
tems with multiple right-hand sides, SIAM J. Sci. Statist. Gomput. 16 
(1995) 9177933. 

[28] V. Simoncini, E. Gallopoulos, Convergence properties of block GMRES and 
matrix polynomials. Linear Algebra Appl. 247 (1996) 97-119. 


21 



[29] V. Simoncini, E. Gallopoulos, A hybrid block GMRES method for nonsym- 
metric systems with multiple right-hand sides, J. Comput. Appl. Math. 66 
(1996) 457-469. 

[30] G. L. Sleijpen, D. R. Eokkema, BICGSTAB(l) for linear equations involving 
unsymmetric matrices with complex spectrum, Electron. Trans. Numer. 
Anal. 1 (1993) 11-32. 

[31] H. Tadano, T. Sakurai, Y. Kuramashi, Block BiCGGR: A new block Krylov 
subspace method for computing high accuracy solutions, JSIAM Lett. 1 
(2009) 44-47. 

[32] H. A. Van Der Vorst, Bi-CGSTAB: A fast and smoothly converging variant 
of Bi-CG for the solution of nonsymmetric linear systems, SIAM J. Sci. 
Stat. Gomput. 13 (1992) 631-644. 

[33] B. Vital, Etude de quelques methodes de resolution de problemes lineaires 
de grande taille sur multiprocesseur, Ph.D. thesis, Univ. de Rennes I, 
Rennes (1990). 

[34] J. Zhang, H. Dai, J. Zhao, Generalized global conjugate gradient squared 
algorithm, Appl. Math. Comput. 216 (2010) 3694-3706. 

[35] J. Zhang, H. Dai, J. Zhao, A new family of global methods for linear systems 
with multiple right-hand sides, J. Comput. Appl. Math. 236 (2011) 1562- 
1575. 


22 



